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Abstract —Empirical auto-tuning and machine learning 
techniques have been showing high potential to improve 
execution time, power consumption, code size, reliability and 
other important metrics of various applications for more than 
two decades. However, they are still far from widespread 
production use due to lack of native support for auto-tuning in 
an ever changing and complex software and hardware stack, 
large and multi-dimensional optimization spaces, excessively 
long exploration times, and lack of unified mechanisms for 
preserving and sharing of optimization knowledge and research 
material. 

We present a possible collaborative approach to solve 
above problems using Collective Mind knowledge management 
system. In contrast with previous cTuning framework, this 
modular infrastructure allows to preserve and share through 
the Internet the whole auto-tuning setups with all related 
artifacts and their software and hardware dependencies besides 
just performance data. It also allows to gradually structure, 
systematize and describe all available research material including 
tools, benchmarks, data sets, search strategies and machine 
learning models. Researchers can take advantage of shared 
components and data with extensible meta-description to quickly 
and collaboratively validate and improve existing auto-tuning and 
benchmarking techniques or prototype new ones. The community 
can now gradually learn and improve complex behavior of all 
existing computer systems while exposing behavior anomalies 
or model mispredictions to an interdisciplinary community in 
a reproducible way for further analysis. We present several 
practical, collaborative and model-driven auto-tuning scenarios. 
We also decided to release all material at c-mind.org/repo to set 
up an example for a collaborative and reproducible research as 
well as our new publication model in computer engineering where 
experimental results are continuously shared and validated by the 
community. 

Keywords—high performance computing, systematic auto-tuning, 
systematic benchmarking, big data driven optimization, modeling 
of computer behavior, performance prediction, collaborative 
knowledge management, public repository of knowledge, No SQL 
repository, code and data sharing, specification sharing, 
collaborative experimentation, machine learning, data mining, 
multi-objective optimization, model driven optimization, agile 
development, plugin-based tuning, performance regression buildbot, 
open access publication model, reproducible research 


I. Introduction and related work 

Computer systems’ users are always eager to have faster, 
smaller, cheaper, more reliable and power efficient computer 
systems either to improve their every day tasks and quality 
of life or to continue innovation in science and technology. 
However, designing and optimizing such systems is becoming 
excessively time consuming, costly and error prone due to 
an enormous number of available design and optimization 
choices and complex interactions between all software and 
hardware components. Furthermore, multiple characteristics 
have to be carefully balanced including execution time, code 
size, compilation time, power consumption and reliability 
using a growing number of incompatible tools and techniques 
with many ad-hoc, intuition based heuristics. 

At the same time, development methodology for computer 
systems has hardly changed in the past decades: hardware is 
first designed and then the compiler is tuned for the new 
architecture using some ad-hoc benchmarks and heuristics. 
As a result, nearly peak performance of the new systems is 
often achieved only for a few previously optimized and not 
necessarily representative benchmarks while leaving most of 
the real user applications severely underperforming. Therefore, 
users are often forced to resort to a tedious and often 
non-systematic optimization of their programs for each new 
architecture. This, in turn, leads to an enormous waste of 
time, expensive computing resources and energy, dramatically 
increases development costs and time-to-market for new 
products and slows down innovation [1], [2], [3], [4]. 

Various off-line and on-line auto-tuning techniques together 
with run-time adaptation and split compilation have been 
introduced during the past two decades to address some of 
the above problems and help users automatically improve 
performance, power consumption and other characteristics of 
their applications. These approaches treat rapidly evolving 
computer system as a black box and explore program and 
architecture design and optimization spaces empirically [5], 
[6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], 
[18], [19], [20]. 

Since empirical auto-tuning was conceptually simple and 
did not require deep user knowledge about programs and 
computer systems, it quickly gained popularity. At the 
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Fig. 1. Rising number of optimization dimensions in GCC in the past 
12 years (boolean or parametric flags). Obtained by automatically 
parsing GCC manual pages, therefore small variation is possible 
(script was kindly shared by Yuriy Kashnikov). 
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Fig. 2. Number of distinct combinations of compiler optimizations 
for GCC 4.7.2 with a maximum achievable execution time speedup 
over -03 optimization level on Intel Xeon E5520 platform across 
285 shared Collective Mind benchmarks after 5000 random iterations 
(top graph) together with a number of benchmarks where these 
combinations achieve more than 10% speedup (bottom graph). 


same time, users immediately faced a fundamental problem: 
a continuously growing number of available design and 
optimization choices makes it impossible to exhaustively 
explore the whole optimization space. For example, Figure 1 
shows a continuously rising number of available boolean and 
parametric optimizations available in a popular, production, 
open-source compiler GCC used in practically all Linux and 
Android based systems. Furthermore, there is no more single 
combination of flags such as -03 or -Ofast that could deliver 
the best execution time across all user programs. Figure 2 
demonstrates 79 distinct combinations of optimizations for 
GCC 4.7.2 that improve execution time across 285 benchmarks 
with just one data set over -03 on Intel Xeon E5520 
based platform after 5000 explored solutions using traditional 
iterative compilation [21] (random selection of compiler 
optimization flags and parameters). 

Optimization space explodes even further when considering 
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Fig. 3. Execution time of a matrix-matrix multiply kernel when 
executed on CPU (Intel E6600) and on GPU (NVIDIA 8600 GTS) 
depending on the size N of square matrix as a motivation for online 
tuning and adaptive scheduling on heterogeneous architectures [22]. 


heterogeneous architectures, multiple data sets and fine grain 
program transformations and parameters including tiling, 
unrolling, inlining, padding, prefetching, number of threads, 
processor frequency and MPI communication [11], [23], [24], 
[21], [25], [26]. For example, Figure 3 shows execution time 
of a matrix-matrix multiplication kernel for square matrices on 
CPU (Intel E6600) and GPU (NVIDIA 8600 GTS), depending 
on their size. It motivates the need for adaptive scheduling 
since it may be beneficial either to run kernel on CPU or GPU 
depending on data set parameters (to amortize the cost of data 
transfers to GPU). However, as we show in [22], [27], [28], 
the final decision tree is architecture and kernel dependent 
and requires both off-line kernel cloning and some on-line, 
automatic and ad-hoc modeling of application behavior. 

Machine learning techniques (predictive modeling and 
classification) have been gradually introduced during the past 
decade as an attempt to address the above problems [29], 
[30], [31], [32], [33], [34], [35], [36], [22], [37]. These 
techniques can help speed up program and architecture 
analysis, optimization and co-design by narrowing down 
regions in large optimization spaces with the most likely 
highest speedup. They usually use prior training similar to 
Figure 2 in case of compiler tuning and predict optimizations 
for previously unseen programs based on some code, data set 
and system features. 

During the MILEPOST project in 2006-2009, we made 
the first practical attempt to move auto-tuning and machine 
learning to production compilers including GCC by combining 
a plugin-based compiler framework [38] and a public 
repository of experimental results (cTuning.org). This approach 
allowed to substitute and automatically learn default 
compiler optimization heuristics by crowdsourcing auto-tuning 
(processing a large amount of performance statistics collected 
from many users to classify application and build predictive 
models) [39], [21], [40]. However, this project exposed even 
more fundamental challenges including: 

• Lack of common, large and diverse benchmarks and data 
sets needed to build statistically meaningful predictive 
models; 
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• Lack of common experimental methodology and unified 
ways to preserve, systematize and share our growing 
optimization knowledge and research material including 
benchmarks, data sets, tools, tuning plugins, predictive 
models and optimization results; 

• Problem with continuously changing, “black box” 
and complex software and hardware stack with 
many hardwired and hidden optimization choices and 
heuristics not well suited for auto-tuning and machine 
learning; 

• Difficulty to reproduce performance results from the 
cTuning.org database submitted by users due to a lack 
of full software and hardware dependencies; 

• Difficulty to validate related auto-tuning and machine 
learning techniques from existing publications due to 
a lack of culture of sharing research artifacts with 
full experiment specifications along with publications in 
computer engineering. 

As a result, we spent a considerable amount of our 
“research” time on re-engineering existing tools or developing 
new ones to support auto-tuning and learning. At the same 
time, we were trying to somehow assemble large and diverse 
experimental sets to make our research and experimentation 
on machine learning and data mining statistically meaningful. 
We spent even more time when struggling to reproduce 
existing machine learning-based optimization techniques from 
numerous publications. 

Worse, when we were ready to deliver auto-tuning solutions 
at the end of such tedious developments, experimentation 
and validation, we were already receiving new versions of 
compilers, third-party tools, libraries, operating systems and 
architectures. As a consequence, our developments and results 
were already potentially outdated even before being released 
while optimization problems considerably evolved. 

We believe that these are major reasons why so many 
promising research techniques, tools and data sets for 
auto-tuning and machine learning in computer engineering 
have a life span of a PhD project, grant funding or publication 
preparation, and often vanish shortly after. Furthermore, we 
witness diminishing attractiveness of computer engineering 
often seen by students as “hacking” rather than systematic 
science. Some of the recent long-term research visions 
acknowledge these problems for computer engineering and 
many research groups search for “holy grail” auto-tuning 
solutions but no widely adopted solution has been found 
yet [2], [3]. 

In this paper, we describe the first, to our knowledge, 
alternative, orthogonal, community-based and big-data driven 
approach to address above problems. It may help make 
auto-tuning a mainstream technology based on our practical 
experience in the MILEPOST, cTuning and Auto-tune 
projects, industrial usage of our frameworks and community 
feedback. Our main contribution is a collaborative knowledge 
management framework for computer engineering called 
Collective Mind (or cM for short) that brings interdisciplinary 
researchers and developers together to organize, systematize, 
share and validate already available or new tools, techniques 
and data in a unified format with gradually exposed actions 


and meta-information required for auto-tuning and learning 
(optimization choices, features and tuning characteristics). 

Our approach should allow to collaboratively prototype, 
evaluate and improve various auto-tuning techniques while 
reusing all shared artifacts just like LEGO™pieces and 
applying machine learning and data mining techniques to 
find meaningful relations between all shared material. It 
can also help crowdsource long tuning and learning process 
including classification and model building among many 
participants while using Collective Mind as a performance 
tracking buildbot. At the same time, any unexpected program 
behavior or model mispredictions can now be exposed to the 
community through unified cM web-services for collaborative 
analysis, explanation and solving. This, in turn, enables 
reproducibility of experimental results naturally and as a 
side effect rather than being enforced - interdisciplinary 
community needs to gradually find and add missing software 
and hardware dependencies to the Collective Mind (fixing 
processor frequency, pinning code to specific cores to avoid 
contentions) or improve analysis and predictive models 
(statistical normality tests for multiple experiments) whenever 
abnormal behavior is detected. 

We hope that our approach will eventually help the 
community collaboratively evaluate and derive the most 
effective auto-tuning and learning strategies. It should also 
eventually help the community collaboratively learn complex 
behavior of all existing computer systems using top-down 
methodology originating from physics. At the same time, 
continuously collected and systematized knowledge (“big 
data”) should help us make more scientifically motivated 
advice about how to improve design and optimization of the 
future computer systems (particularly on our way towards 
extreme scale computing). Finally, we believe that it can 
naturally make computer engineering a systematic science 
while supporting Vinton G. Cerf’s recent vision [41]. 

This paper is organized as follows: the current section 
provides motivation for our approach and related work. 
It is followed by Section II presenting possible solution 
to collaboratively systematize and unify our knowledge 
about program optimization and auto-tuning using public 
Collective Mind framework and repository. Section III 
presents mathematical formalization of auto-tuning techniques. 
Section IV demonstrates how our collaborative approach can 
be combined with several existing plugin-based auto-tuning 
infrastructures including MILEPOST GCC [40], OpenME [42] 
and Periscope Tuning Framework (PTF) [26] to start 
systematizing and making practical various auto-tuning 
scenarios from our industrial partners including continuous 
benchmarking and comparison of compilers, validation of 
new hardware designs, crowdsourcing of program optimization 
using commodity mobile phones and tablets, automatic 
modeling of application behavior, model driven optimization 
and adaptive scheduling. It is followed by a section on 
reproducibility of experimental results in our approach 
together with a new publication model proposal where all 
research material is continuously shared and validated by the 
community. The last section includes conclusions and future 
work directions. 
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II. Cleaning up research and experimental mess 



Fig. 4. Converting (a) continuously evolving, ad-hoc, hardwired and 
difficult to maintain experimental setups to (b) interconnected cM 
modules (tool wrappers) with unified, dictionary-based inputs and 
outputs, data meta-description, and gradually exposed characteristics, 
tuning choices, features and a system state. 

Based on our long experience with auto-tuning and 
machine learning in both academia and industry, we now 
strongly believe that the missing piece of the puzzle 
to make these techniques practical is to enable sharing 
systematization, and reuse of all available optimization 
knowledge and experience from the community. However, our 
first attempt to crowdsource auto-tuning and machine learning 
using cTuning plugin-based framework and MySQL-based 
repository (cTuning) [40], [43] suffered from many orthogonal 
engineering issues. For example, we had to spend considerable 
effort to develop and continuously update ad-hoc research and 
experimental scenarios using many hardwired scripts and tools 
while being able to expose only a few dimensions, monitor a 
few characteristics and extract a few features. At the same 
time, we struggled with collection, processing and storing of 
a growing amount of experimental data in many different 
formats as conceptually shown in Figure 4a. Furthermore, 
adding a new version of a compiler or comparing multiple 
compilers at the same time required complex manipulations 
with numerous environment variables. 

Eventually, all these problems motivated the development 
of a modular Collective Mind framework and NoSQL 
heterogeneous repository [42], [44] to unify and preserve 
the whole experimental setups with all related artifacts and 
dependencies. First of all, to avoid invoking ad-hoc tools 
directly, we introduced cM modules which serve as wrappers 
around them to be able to transparently set up all necessary 
environment variables and validate all software and hardware 
dependencies before eventually calling these tools. Such an 
approach allows easy co-existence of multiple versions of 
tools and libraries while protecting experimental setups from 
continuous changes in the system. Furthermore, cM modules 
can now transparently monitor and unify all information flow 
in the system. For example, we currently monitor tools’ 


command line together with their input and output files 
to expose measured characteristics (behavior of computer 
systems), optimization and tuning choices, program, data set 
and architecture features, and a system state used in all 
our existing auto-tuning and machine learning scenarios as 
conceptually shown in Figure 4b. 

Since researchers are often eager to quickly prototype 
their research ideas rather than sink in low-language 
implementations, complex APIs and data structures that may 
change over time, we decided to use a researcher friendly and 
portable Python language as the main language in Collective 
Mind (though we also provide possibility to use any other 
language for writing modules through an OpenME interface 
described later in this paper in Section IV). Therefore, it is 
possible to run minimal cM on practically any Linux and 
Windows computer supporting Python. An additional benefit 
of using Python is a growing collection of useful packages for 
data management, mining and machine learning. 

We also decided to switch from traditional TXT, CSV 
and XML formats used in the first cTuning framework to a 
schema-free JSON data format [45] for all module inputs, 
outputs and meta-description. JSON is a popular, human 
readable and open standard format that represent data objects 
as attribute — value pairs. It is now backed up by many 
companies, supported by most of the recent languages and 
powerful search engines [46], and can be immediately used 
for web services and P2P communication during collaborative 
research and experimentation. Only when the format of data 
becomes stable or a research technique is validated, the 
community can provide data specification as will be described 
later in this paper. 

At the same time, we noticed that we can apply exactly 
the same concept of cM modules to systematize and describe 
any research and development material (code and data) while 
making sure that it can be easily found, reused and exposed to 
the Web. Researchers and developers can now categorize any 
collections of their files and directories by assigning an existing 
or adding a new cM module and moving their material to a new 
directory with a unique ID (UID) and an optional alias. Thus 
we can now abstract an access to highly heterogeneous and 
evolving material by gradually adding possible data actions and 
meta-description required for user’s research and development. 
For example, all cM modules have common actions to manage 
their data in a unified way similar to any repository such 
as add, list, view, copy, move and search. In addition, module 
code.source abstracts access to programs and has an individual 
action build to compile a given program. In fact, all current 
cM functionality is implemented as interconnected modules 
including kernel, core and repo that provide main low-level 
cM functions documented at c-mind.org/doxygen. 

In contrast with using SQL-based databases, our approach 
can help systematize, preserve and describe any heterogeneous 
code and data on any native file system without any need for 
specialized databases, pre-defined data schema and complex 
table restructuring as conceptually shown in Figure 5. Since 
cM modules also have their own UOA (UID or alias), it 
is now possible to easily reference and find any local user 
material similar to DOI by a unified Collective ID (CID) of 
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Category 

Third-party tools, 
libraries 

High-level algorithms 

Applications, 
benchmarks, kernels 

Multiple compilers 

Predictive models 

Binaries and libraries 

Program data sets 

Operating Systems 

Processors 

Statistical and data 
mining functions 

cM module Module actions 

All data 

Meta 

description 

) package j common*, install 



algorithm j common*, transform 


| code.source J common*, build : 

mm-m 

| ctuning.compiler j common*, compile_program 



if math.model | common ’.build, predict, fit, 

( ---- > detect_representative_points 


mm-m 

code | common*, run 



dataset j common*, create 


mm-m 

os j common*, detect_host_family 



[ processor 1 c ° mmon • 



| math.statistics.r j common*, analyze 

... 

mm-m 

* add, list, view, copy, move, search 



cM repository 
directory structure: .cmr 

/ module UOA (UID or alias) 

/ data UOA 

/ .cm / data.json 


Fig. 5. Gradually categorizing all available user artifacts using cM 
modules while making them searchable through meta-description and 
reusable through unified cM module actions. All material from this 
paper is shared through Collective Mind online live repository at 
c-mind.org/browse and c-mind.org/github-code-source. 


the following format: 

(cM module UOA):(cM data UOA). 

In addition, cM provides an option to transparently 
index meta-description of all artifacts using a third-party, 
open-source and JSON-based ElasticS earch framework 
implemented on Hadoop engine [46] to enable fast and 
powerful search queries for such a schema-free repository. 
Note that other similar NoSQL databases including MongoDB 
and CouchDB or even SQL-based repositories can be easily 
connected to cM to cache data or speed up queries, if 
necessary. 

Any cM module with a given action can be executed in a 
unified way using JSON format as both input and output either 
through cM command line front-end 
cm (;module UOA) (action) @input.json 
or using one Python function from a cM kernel module 
r=cm_kernel.access( {’ cm_run_module_uoa (module UOA), 

’ cm_action’.‘(action), action parameters }) 
or as a web service when running internal cM web server (also 
implemented as cM web. server module) using the following 
URL 

http://localhost:3333 ?cm_web_module_uoa - (module UOA) 
&cm_web_action=(action) ... . 

Lor example, a user can list all available programs in the 
system using cm code.source list and then compile a given 
program using 

cm code, source build 

work_dir_data_uoa=benchmark-cbench-security-blowfish 

build_target_os_uoa=windows-generic-64 . 

If some parameters or dependencies are missing, the cM 
module should be implemented as such to inform users about 
how to fix these problems. 

In order to simplify validation and reuse of shared 
experimental setups, we provide an option to keep tools 
inside a cM repository also together with a unified installation 


mechanism that resolves all software dependencies. Such 
packages including third-party tools and libraries can now 
be installed into a different entry in a cM repository with 
a unique IDs abstracted by cM code module. At the same 
time, OS-dependent script is automatically generated for each 
version of a package to set up appropriate environment 
including all paths. This script is automatically called before 
executing a given tool version inside an associated cM module 
as shown in Ligure 4. 

In spite of its relative simplicity, the Collective Mind 
approach helped us to gradually clean up and systematize 
our material that can now be easily searched, shared, reused 
or exposed to the web. It also helps substitute all ad-hoc 
and hardwired experimental setups with interconnected and 
unified modules and data that can be protected from continuous 
changes in computer systems and easily shared among 
workgroups. Users only need to categorize new material, move 
related files to a special directory of format .cmr/(module 
UOA)/(data UOA) (where .cmr is an acronym for Collective 
Mind Repository) to be automatically discovered and indexed 
by cM, and provide some meta-information in JSON format 
depending on research scenarios. 

In contrast with public web-based sharing services, 
we provide an open-source, technology-neutral, agile, 
customizable, and portable knowledge management system 
which allows both private and public systematization of 
research and experimentation. To initiate and demonstrate 
gradual and collaborative systematization of a research 
material for auto-tuning and machine learning, we decided 
to release all related code and data at c-mind.org/browse to 
discuss, validate and rank shared artifacts while extending 
their meta-description and abstract actions with the help 
of the community. We described and shared multiple 
benchmarks, kernels and real applications using cM 
code.source module, various data sets using cM dataset 
module, various parameterized classification algorithms and 
predictive models using cM math.model module, and many 
others. 

We also shared packages with exposed dependencies 
and installation scripts for many popular tools and 
libraries in our public cM repository at c-mind.org/repo 
including GCC, LLVM, ICC, Microsoft Visual Studio 
compilers, PGI compilers, Open64/PathScale compilers, ROSE 
source-to-source compilers, Oracle JDK, VTune, NVIDIA 
GPU toolkit, perf, gprof, GMP, MPFR, MPC, PPL, LAPACK, 
and many others. We hope that this will ease the burden of the 
community to continuously (re-)implement some ad-hoc and 
often unreleased experimental scenarios. In the next sections, 
we will show how this approach can be used to systematize 
and formalize auto-tuning. 

III. Formalizing auto-tuning and predictive 

MODELING 

Almost all research on auto-tuning can be formalized as 
finding a function of a behavior of a given user program B 
running on a given computer system with a given data set, 
selected design and optimization choices including program 
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transformations and architecture configuration c, and a system 
state s ([5], [6], [47], [40]): 

b = B{ c,s) 

For example, in our current and past research and 
experimentation, b is a behavior vector that includes execution 
time, power consumption, accuracy, compilation time, code 
size, device cost, and other important characteristics; c 
represents the available design and optimization choices 
including algorithm selection, the compiler and its 
optimizations, number of threads, scheduling, affinity, 
processor ISA, cache sizes, memory and interconnect 
bandwidth, etc; and finally s represents a state of the 
system including processor frequency and cache or network 
contentions. 

Knowing and minimizing this function is of a particular 
importance to our industrial partners when designing, 
validating and optimizing the next generation of new 
hardware and software including compilers for a broad 
range of customers’ applications, data sets and requirements 
(constraints), since it can help reduce time to market and cost 
for the new systems while increasing return on investment 
(ROI). However, the fundamental problem is that this function 
B is highly non-linear with a multi-dimensional discrete and 
continuous space of choices [11], [47] which is rarely possible 
to model analytically or evaluate empirically using exhaustive 
search unless really small kernels and libraries are used with 
just one or a few program transformations [5], [6]. 

This problem motivated research on automatic and empirical 
modeling of an associated function P that can quickly predict 
better design and optimization choices for a given computer 
system c based on some features (properties) of an end-users’ 
program, data set and a given hardware f, and a current state 
of a computer system s: 

c = P(f,s) 

For example, in our research on machine-learning based 
optimization, vector f includes semantic or static program 
features [29], [30], [34], [40], data set features and hardware 
counters [35], [22], system configuration, and run-time 
environment parameters among many others. However, when 
trying to implement practical and industrial scenarios in 
cTuning framework, we spent most of our time on engineering 
issues trying to expose characteristics, choices, features and 
system state using numerous, “black box” and not necessarily 
documented tools. Furthermore, when colleagues with a 
machine learning background were trying to help us improve 
optimization predictions, they were often quickly demotivated 
when trying to understand our terminology and problems. 

The Collective Mind approach helped our colleagues solve 
this problem by formalizing the problem and gradually 
exposing characteristics b, choices c, system state s and 
features f (meta information) in experimental setups using 
JSON format as shown in the following real example: 

{’ characteristics 

”executionjimes”: [” 10.3”,” 10.1 ”,”13.3”], 

”code_size”: ”131938”, ...}, 


” choices ”:{ 

”os”:”linux”, ”os_version”:”2.6.32-5-amd64”, 

”compiler”:”gcc”, ”compiler_version”:”4.6.3”, 

”compiler Jlags ”: ”-03 -fno-if-conversion ”, 

”platform”:{” 

”processor”: ”intel_xeon_e5520”, ”12”: ”8192 ”, 

”memory”:”24” ...}, ...}, 

’ features ”:{ 

” semantic ^features”: {” number_of_bb”: ”24”, ...}, 

”hardware_counters”: {”cpi”: ”1.4” ...}, ... } 

’’state” :{ 

’frequency”:”2.27”, ...} 

} 

Furthermore, we can easily convert JSON hierarchical 
data into a flat vector format to apply above mathematical 
formalization of auto-tuning and learning problem while 
making it easily understandable to an interdisciplinary 
community particularly with a background in mathematics 
and physics. In our flat format, a flat key can reference any 
key in a complex JSON hierarchy as one string. Such flat key 
always starts with # followed by #key if it is a dictionary key 
or @position_in_a_list if it is a value in a list. For example, 
flat key for the second execution time ”10.1” in one of the 
previous examples of information flow can be referenced as 
”##characteristics#executionJime@l Finally, users can 
gradually provide the following cM data specification for 
the flat keys in information flow to fully automate program 
optimization and learning (kept together with a given cM 
module): 

” flattenedJson_k ”:{ 

”type”: ’’text” | ’’integer” | ’’float” | ’’diet” | ’’list” | ”uid”, 

”characteristic”: ”yes” | ”no”, 

’feature”: ”yes” | ”no”, 

’’state”: ”yes” | ”no”, 

”has_choice”: ”yes” | ”no”, 

’’choices”: /list of strings if categorical choice’7, 

”explore_start”: ’’start number if numerical range”, 

”explore_stop ”: ’’stop number if numerical range”, 

”explore_step ”: ’’step if numerical range”, 

”can_be_omitted”: ”yes” | ”no”, 

”default_value”: ’’string” 

} ’ 

Of course, such format may have some limitations, but 
it supports well our current research and experimentation 
on auto-tuning and will be extended only when needed. 
Furthermore, such implementation allowed us and our 
colleagues to collaboratively prototype, validate and improve 
various auto-tuning and learning scenarios simply by chaining 
available cM modules similar to components and filters in 
electronics (cM experimental pipelines) and reusing all shared 
artifacts. For example, we converted our ad-hoc build and 
run scripts from cTuning framework to a unified cM pipeline 
consisting of chained cM modules as shown in Figure 6. This 
pipeline ( ctuning.pipeline.build_and_run ) is implemented and 
executed as any other cM module to help researchers simplify 
the following operations during experimentation: 
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In the next section we show how we can reuse 
and customize this pipeline (demonstrated online 
at c-mind.org/ctuning-pipeline ) to systematize and run 
some existing auto-tuning scenarios from our industrial 
partners. 

IV. Systematizing auto-tuning and learning 
scenarios 

Unified cM build and run pipeline combined with 
mathematical formalization allows researchers and engineers 
to focus their effort on implementing and extending universal 
auto-tuning and learning scenarios rather than hardwiring 
them to specific systems, compilers, optimizations or tuned 
characteristics. This, in turn, allows to distribute long tuning 
process across multiple users while potentially solving an 
old and well-known problem of using a few possibly 
non-representative benchmarks and a limited number of 
architectures when developing and validating new optimization 
techniques. 


Fig. 6. Unified build and run cM pipeline implemented as chained 
cM modules. 


• Source to source program transformation and 
instrumentation (if required). For example, we 
added support for PLUTO polyhedral compiler to 
enable automatic restructuring and parallelization of 
loops [48]. 

• Compilation and execution of any shared programs (real 
applications, benchmarks and kernels) using code.source 
module. Meta-description of these programs includes 
information about how to build and execute them. Code 
can be executed with any shared data set as an input 
(<dataset module). The community can gradually share 
more data sets together with the unified descriptions of 
their features such as dimensions of images, sizes of 
matrices and so on. 

• Testing of measured characteristics from repeated 
executions for normal distribution [49] using shared 
cM module ctuning.filter variation to be able to 
expose unusual behavior in a reproducible way 
to the community for further analysis. Unexpected 
behavior often means that some feature is missing 
in the experimental pipeline such as frequency or 
cache contention that can be gradually added by the 
community to separate executions with different contexts 
as described further in Section V. 

• Applying Pareto frontier filter [50], [19], [40] to 
leave only optimal solutions during multi-objective 
optimization when multiple characteristics have to be 
balanced at the same time such as execution time vs 
code size vs power consumption vs compilation time. 
This, in turn, can help to avoid collecting large amounts 
of off-line and often unnecessary experimental data that 
can easily saturate repositories and make data analysis 
too time consuming or even impossible (as happened 
several times with a public cTuning repository) . 


Gradually extend cM build 
and run pipeline module 


Gradually expose Gradually expose design and 

characteristics optimization choices, features 


Select algorithm 

1 


(time) productivity, 
variable-accuracy, 
complexity ... 

Analyze and 
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model type ... 


Fig. 7. Gradual and collaborative top-down decomposition 
of computer system software and hardware using cM modules 
(wrappers) similar to methodology in physics. First, coarse-grain 
design and optimization choices and features are exposed and tuned, 
and later more fine-grain choices are exposed depending on the 
available tuning time budget and expected return on investment. 

Furthermore, it is now possible to take advantage of 
mature interdisciplinary methodologies from other sciences 
such as physics and biology to analyze and learn the 
behavior of complex systems. Therefore, cM uses a top-down 
methodology to decompose software and hardware into simple 
sub-components to be able to start learning and tuning of a 
global, coarse-grain program behavior with respect to exposed 
coarse-grain tuning choices and features. Later, depending 
on user requirements, time budget and expected return on 













































investment during optimization, the community can extend 
components to cover finer-grain tuning choices and behavior 
as conceptually shown in Figure 7. Note, that when analyzing 
a application at a finer-grain levels such as code regions, we 
consider them as interacting cM components with their own 
vectors of tuning choices, characteristics, features and internal 
states. In doing so, we can analyze and learn their behavior 
using methodologies from quantum mechanics or agent-based 
modeling [51]. 

A. Unifying design and optimization space exploration 

As the first practical usage scenario, we developed a 
universal and customizable design and optimization space 
exploration as cM module ctuning. scenario, exploration on 
top of ctuning.pipeline.build_and_run_program module to 
substitute most ad-hoc tuning scripts and frameworks from 
our past research. This scenario can be executed from the 
command line as any other cM module thus enabling relatively 
easy integration with third-party tools including compiler 
regression buildbots or Eclipse-based framework. However, the 
most user friendly way to run scenarios is through the cM web 
interface as demonstrated at c-mind.org/ctuning-exploration 
(note that we plan to improve the usability of this interface 
with dynamic HTML, JavaScript and Ajax technology [52] 
while hiding unnecessary information from users and avoiding 
costly page refreshes). In such a way, cM will query all 
chained modules for this scenario to automatically visualize all 
available tuning choices, characteristics, features and system 
states. cM will also preset all default values (if provided by 
specification) while allowing a user to select which choices to 
explore, characteristics to measure, search strategy to use, and 
statistical analysis for experimental results to apply. 

We currently implemented and shared uniform random 
and exhaustive exploration strategies. We also plan to add 
adaptive, probabilistic and hill climbing sampling from our 
past research [17], [21] or let users develop and share any other 
universal strategy which is not hardwired to any specific tool 
but can explore any available choices exposed by the scenario. 

Next, we present several practical and industrial auto-tuning 
scenarios using above customized exploration module. 

B. Systematizing compiler benchmarking 

Validating new architecture designs across multiple 
benchmarks, tuning optimization heuristics of multiple 
versions of compilers, or tuning compiler flags for a customer 
application is a tedious, time consuming and often ad-hoc 
process that is far from being solved. In fact, it becomes even 
tougher with time due to the ever rising number of available 
optimizations (Figure 1) and many strict requirements placed 
on compilers such as generating fast and small code for all 
possible existing architectures within a reasonable amount of 
time. 

Collective Mind helps unify and distribute this process 
among many machines as a performance tracking 
buildbot. For this purpose, we customized universal 
cM exploration module for compiler flag tuning as 
a new ctuning. scenario, compiler, optimizations module. 


Users just need to choose a compiler version and 
related description of flags (example is available at 
c-mind.org/ctuning-compiler-desc) as an input and select 
either to explore a compiler flag optimization space for a 
given program or distribute tuning of a default compiler 
optimization heuristic across many machines using a set of 
shared benchmarks. Note, that it is possible to use Collective 
Mind not only on desktop machines, servers, data centers 
and cloud services but also on bare metal hardware or 
Android-based mobile devices (either through SSH or using a 
special Collective Mind Node application available in Google 
Play Store [53] to help deploy and crowdsource experiments 
on mobile phones and tablets while aggregating results in 
web-based cM repositories). 



Fig. 8. Compiler flag auto-tuning to improve execution time and code 
size of a shared image corner detection program with a fixed data 
set on Samsung Galaxy Series mobile phone using cM for Android. 
Highlighted points represent frontier of optimal solutions as well as 
GCC with -03 and -Os optimization flags versus LLVM with -03 
flag ( c-mind . org/interactive-graph-demo). 

To demonstrate this scenario, we optimized a real image 
corner detection program on a commodity Samsung Galaxy 
Series mobile phone with ARMv6 830MHz processor using 
Sourcery GCC v4.7.2 with randomly generated combinations 
of compiler flags of format -03 -f(no-OptimizationJlag 
-parameter param=random_number_from_range , LLVM v3.2 
with -03 flag, and a chained Pareto frontier filter (cM 
module ctuning.filter.frontier) for multi-objective optimization 
(balancing execution time, code size and compilation time). 

Experimental results during such exploration (cM module 
output) are continuously recorded in a repository in a unified 
flat vector format making it possible to immediately take 
advantage of numerous and powerful public web services for 
visualization, data mining and analytics (for example from 
Google, Microsoft, Oracle and IBM) or available as packages 
for Python, Weka, MATLAB, SciLab, and R. For example, 
Figure 8 shows 2D visualization of these experimental results 
using public Google Web Services integrated with cM. Such 
interactive graphs are particularly useful when working in 
workgroups or for interactive publications (as demonstrated 
at c-mind.org/interactive-graph-demo). 

Note, that we always suggest to run optimized code several 
times to check variation and test distribution for normality 
as we used to do in physics and electronics. If such a test 
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fails or the variation of any characteristic dimension is more 
than some threshold (currently set as 2%), we do not skip 
such case but record it as suspicious including all inputs and 
outputs for further validation and analysis by the community 
as described in Section V. At the same time, using a Pareto 
frontier filter allows users to easily select the most appropriate 
solution depending on the further intended usage of their 
applications, i.e. the fastest variant if used for HPC systems, 
the smallest variant if used for embedded devices with very 
limited resources, such as credit card chips or the future 
“Internet of Things” devices, or balanced for both speed and 
size when used in mobile phones and tablets. 

Since Collective Mind also enables co-existence of multiple 
versions of different compilers, checks output of programs for 
correct execution during optimization, and supports multiple 
shared benchmarks and data sets, it can be easily used as 
a distributed and public buildbot for rigorous performance 
tracking and simultaneous tuning of compilers (as shown in 
Figure 2) while taking advantage of a growing number of 
shared benchmarks and data sets [54]. Longer term, we expect 
that such an approach will help the community fully automate 
compiler tuning for new architectures or even validate new 
processor designs for errors. It can also help derive a realistic, 
diverse and representative training set of benchmarks and data 
sets [24] to systematize and speed up training for machine 
learning based optimization prediction for previously unseen 
programs and architectures [40]. 

To continue this collaborative effort, we shared the 
description of all parametric and boolean (on or off) 
compiler flags in JSON format as “choices” for a number of 
popular compilers including GCC, LLVM, Open64, PathScale, 
PGI and ICC under ctuning. compiler module. We also 
implemented and shared several off-the-shelf classification 
and predictive models including KNN and SVM from our 
past research [34], [35], [40] using math.model module to 
be able to automatically predict better compiler optimization 
using semantic and dynamic program features. Finally, we 
started implementing standard complexity reduction and 
differential analysis techniques [55], [56] in cM to iteratively 
isolate unusual program behavior [57] or to find minimal 
set of representative benchmarks, data sets and correlating 
features [24], [42]. Users can now collaboratively analyze 
unexpected program behavior, improve predictive models, find 
best tuning strategies and collect minimal set of influential 
optimizations, representative features, most accurate models, 
benchmarks and data sets. 

C. Systematizing modeling of application behavior to focus 
optimizations 

Since programs may potentially have an infinite number 
of data sets while auto-tuning is already time consuming, it 
is usually performed for one or a few and not necessarily 
representative data sets. Collective Mind can help to 
systematize and automate modeling of a behavior of a given 
application across multiple data sets to suggest where to focus 
further tuning (adaptive sampling and online learning) [34], 
[58], [21]. We just needed to customize previously introduced 



Data set feature N (size) exposed by community as cM meta-description 

Fig. 9. Online learning (predictive modeling) of a CPI behavior of a 
shared LU-decomposition benchmark on 2 different platforms (Intel 
Core2 shown in red vs Intel i5 shown in blue) vs vector size N (data 
set feature). 


auto-tuning pipeline to explore data set parameters (already 
exposed through dataset module) and model program behavior 
at the same time using either off-the-shelf predictive models 
including linear regression, Support Vector Machines (SVM), 
Multivariate Adaptive Regression Splines (MARS), and neural 
networks available for R language and abstracted by cM 
module math.model.r , or shared user-defined hybrid models 
specific for a given application. 

For example, Figure 9 demonstrates how such exploration 
and online learning is performed using cM together with shared 
LU-decomposition benchmark versus size of input vector 
(N), measured CPI characteristic, and 2 Intel-based platforms 
(Intel Core2 Centrino T7500 Merom 2.2GHz L1=32KB 8-way 
set-associative, L2=4MB 16-way set associative - red dots vs. 
Intel Core i5 2540M 2.6GHz Sandy Bridge L1=32KB 8-way 
set associative, L2=256KB 8-way set associative, L3=3MB 
12-way set associative - blue dots). 

In the beginning, cM does not have any knowledge about 
behavior of this (or any other) benchmark, so it simply 
observes and stores available characteristics along with the 
data set features. At each exploration (sampling) step, cM 
processes all historical observations using various available or 
shared predictive models such as SVM or MARS in order to 
find correlations between data set features and characteristics. 
At the same time it attempts to minimize Root-Mean-Square 
Deviation (RMSE) between predicted and measured values for 
all available points. Even if RMSE is relatively low, cM can 
continue exploring and observing behavior in order to detect 
discrepancies (failed predictions). 

Interestingly, in our example, practically no off-the-shelf 
model could detect the A outliers (singularities) which 
appear due to cache alignment problems. However, having 
mathematical formalization helps interdisciplinary community 
to find and share better models that minimized RMSE and 
model size at the same time. In the presented case, our 
colleagues from machine learning department managed to fit 
and share a hybrid, parameterized, rule-based model that first 
validates cases where data set size is a power of 2, otherwise 
it uses linear models as functions of a data set and cache size. 
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Fig. 10. CPI behavior of a matrix-matrix multiply benchmark 
(CID=45741 e3fbcf4024b: 116a9c375e7d7e 14) on Intel i5 platform 
vs different matrix sizes. Hyperplanes separate areas with similar 
behavior found using multivariate adaptive regression splines 
(MARS). 

This model resembles reversed analytical roofline model [59] 
though is continuously and empirically refined to capture even 
fine-grain effects. In contrast, standard MARS model managed 
to predict the behavior of a matrix-matrix multiplication kernel 
for different matrix sizes as shown in Figure 10. 

Such models can help focus auto-tuning on areas with 
distinct behavior as described in [17], [58], [21]. For example 
presented in Figure 9, outlier points A can be optimized 
using array padding; area B can profit from parallelization and 
traditional compiler optimizations targeting ILP; areas C-E can 
benefit from loop tiling; points A saturate memory bus and can 
also benefit from reduced processor frequency to save energy. 
Such optimizations can be performed automatically if exposed 
through cM or provided by the community as shared advices 
using ctuning.advice module. 

In the end, multiple customizable models can be shared 
as parameterized cM modules along with applications thus 
allowing the community to continuously refine them or 
even reuse them for similar classes of applications. Finally, 
such predictive models can be used for effective and online 
compaction of experiments while avoiding collection of a large 
amount of data (known in other fields as a “big data” problem) 
and leaving only representative or unexpected behavior. It 
can, in turn, minimize communications between cM nodes 
while making Collective Mind a giant and distributed learning 
and decision making network to some extent similar to the 
brain [42]. 

D. Enabling fine-grain auto-tuning through plugins 

After learning and tuning coarse-grain behavior, we 
gradually move to finer-grain levels including selected 
code regions, loop transformations, MPI parameters and 
so on, as shown in Figure 7. However, in our past 
research, it required messy instrumentation of applications and 
development of complex source-to-source transformation tools 
and pragma-based languages. 



Fig. 11. Conceptual structure of compilers supporting plugin and 
event based interface to enable fine-grain tuning and learning of their 
internal and often hidden heuristics through external plugins [38]. 


As an alternative and simpler solution, we developed 
an event-based plugin framework (Interactive Compilation 
Interface and was recently substituted by a new and universal 
OpenME plugin-based framework connected to cM) to 
expose tuning choices and semantic program features from 
production compilers such as GCC and LLVM through 
external plugins [60], [43], [38]. This plugin-based tuning 
technique helped us to start unifying, cleaning up and 
converting rigid compilers into powerful and flexible research 
toolsets. Such an approach also helped companies and 
end-users to develop their own plugins with customized 
optimization and tuning scenarios without rebuilding compilers 
and instrumenting applications thus keeping them clean and 
portable. 

This framework also allowed to easily expose multiple 
semantic code features to automatically learn and improve 
all optimization and tuning decisions using standard machine 
learning techniques as conceptually shown in Figure 11. 
This plugin-based tuning technique was successfully used 
in the MILEPOST project to automate online learning and 
tuning of the default optimization heuristic of GCC for 
new reconfigurable processors from ARC during software 
and hardware co-design [40]. The plugin framework was 
eventually added to mainline GCC since version 4.6. We 
are gradually adding it to cM to support plugin-based 
selection and ordering of internal compiler passes, tuning 
and learning of internal compiler decisions, and aggregation 
of semantic program features in a unified format using 
ctuning.scenario.program.features.milepost module. 

Plugin-based static compilers can help users automatically 
or interactively tune a given application with a given data 
set for a given architecture. However, different data sets or 
run-time system state often require different optimizations and 
tuning parameters that should be dynamically selected during 
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Fig. 12. PTF plugin-based application online tuning framework. 


execution. Therefore, Periscope Tuning Framework (PTF) [26] 
was designed to enable and automate online tuning of parallel 
applications using external plugins with integrated tuning 
strategies. Users need to instrument application to expose 
required tuning parameters and measured characteristics for 
a given application. At the same time, tuning space can be 
considerably reduced inside such plugins per given application 
using previous compiler analysis or expert knowledge about 
typical performance bottlenecks and ways to detect and 
improve them as conceptually shown in Figure 12. 

Once the online tuning process is finished, PTF generates 
a report with the recommended tuning actions which can be 
integrated either manually or automatically into the application 
for further production runs. Currently, PTF includes plugins 
to tune execution time of high-level parallel kernels for 
GPGPUs, balance energy consumption via CPU frequency 
scaling, optimize MPI runtime parameters among many other 
scenarios in development. 

Collective Mind can help PTF distribute tuning of shared 
benchmarks and data sets among many users, aggregate 
results in a common repository, apply data mining and 
machine learning plugins to prune tuning spaces, and automate 
prediction of optimal tuning parameters. PTF and cM can 
also complement each other in terms of tuning coverage since 
cM currently focuses on global, high-level, machine-learning 
guided optimizations and compiler tuning while PTF currently 
focuses on finer-grain online application tuning. In our future 
work we plan to connect PTF and cM together using cM 
OpenME interface. 



Fig. 13. Making run-time adaptation and tuning practical using static 
multi-versioning, features exposed by users or automatically detected, 
and predictive modeling (decision trees) while avoiding complex 
dynamic recompilation frameworks. 


depending on data set features, target architecture features 
and a system state [61], [24], [62], [22]. However, since this 
approach still requires a long and off-line training phase, we 
can now use Collective Mind to systematize off-line tuning 
and learning of a program behavior across many data sets and 
computer systems as conceptually shown in Figure 13. 

#ifdef OPENME 

openme_callback("KERNEL_START", NULL); 

#endif 

if (adaptive_select==0) mm2_cpu(A, B, C, D, E); 
elif (adaptive_select==l) cl_launch_kernel(A,B,C,D,E); 
elif (adaptive_select==2) mm2Cuda(A, B, C, D, E, 

E_outputFromGpu); 

#ifdef OPENME 

openme callbackf'KERNEL END", NULL); 
openme_callback("PROGRAM_START", NULL); #endjf " 

#endif 


2mm.c / 2mm.cu 

#ifdef OPENME 
#include <openme.h> 
#endif 


int main(void) { 

#ifdef OPENME 

openme_init(NULL,NULL,NULL,0); 


#ifdef OPENME 

#ifdef OPENME openme callbackf'PROGRAM END", NULL); 

openme_callback("SELECT_KERNEL", &adapt); #endjf " 

#endif 

} 

Fig. 14. Example of predictive scheduling of matrix-matrix 
multiplication kernel for heterogeneous architectures using OpenME 
interface and statically generated kernel clones with different 
algorithm implementations and optimizations to find the winning one 
at run-time. 


E. Systematizing split compilation and adaptive scheduling 

Many current online auto-tuning techniques have a 
limitation - they usually do not support arbitrary online code 
restructuring unless complex just-in-time (JIT) compilers are 
used. As a possible solution to this problem, we introduced 
split compilation to statically enable dynamic optimizations 
and adaptation by cloning hot functions or kernels during 
compilation and providing run-time selection mechanism 


Now, users can take advantage of continuously collected 
knowledge about program behavior and optimization in 
the repository to derive a minimal set of representative 
optimizations or tuning parameters covering application 
behavior across as many data sets and architectures as 
possible [24]. Furthermore, it is now possible to reuse machine 
learning techniques from cM to automatically derive small 
and fast decision trees needed for realistic cases shown in 
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Figures 3, 9 and 10. Such decision trees can now be integrated 
with the application through OpenME or PTF plugins to 
dynamically select appropriate clones and automatically adapt 
for heterogeneous architectures particularly in supercomputers 
and data centers, or even execute some external tools to 
reconfigure architecture (change frequency, for example) based 
on exposed features to minimize execution time, power 
consumption and other user objectives. These data set, 
program and architecture features can also be exposed through 
plugins either automatically using OpenME-based compilers or 
manually through application annotation and instrumentation. 

OpenME was designed especially to be very easy to use 
for researchers and provide a simple connection between 
Python-based Collective Mind and other modules or plugins 
written in other languages including C, C++, Fortran and Java. 
It has only two functions to initialize an event with an arbitrary 
string name, and to call it with a void type argument that will be 
handled by a user plugin and can range from a simple integer to 
a cM JSON dictionary. However, since such implementation of 
OpenME can be relatively slow, we use fast Periscope Tuning 
Framework for fine-grain tuning. Possible example of such 
implementation for predictive scheduling of matrix multiply 
using OpenME interface and several clones for heterogeneous 
architectures [22] is presented in Figure 14. 

Our static function cloning approach with dynamic 
adaptation was recently added to mainline GCC since 
version 4.8. We hope that together with OpenME, PTF and 
cM, it will help systematize research on split compilation 
while focusing on finding and exposing the most appropriate 
features to improve run-time adaptation decisions [24] using 
recent advances in machine learning, data mining and decision 
making [63], [64], [65], [66]. 

F. Automating benchmark generation and differential analysis 

Our past research on machine learning to speed up 
auto-tuning suffered from yet another well-known problem: 
lack of large and diverse benchmarks. Though Collective Mind 
helps share multiple programs and data sets including ones 
from [23], [67], [40], it may still not be enough to cover 
all possible program behavior and features. One possibility 
is to generate many synthetic benchmarks and data sets but 
it always result in explosion in tuning and training times. 
Instead, we propose to use Alchemist plugin [42] together 
with plugin-enabled compilers such as GCC to use existing 
benchmarks, kernels and even data sets as templates and 
randomly modify them by removing, modifying or adding 
various instructions, basic blocks, loops and so on. Naturally, 
we can ignore crashing variants of the code and continue 
evolving only the working ones. 

We can use such an approach not only to gradually extend 
realistic training sets, but also to iteratively identify various 
behavior anomalies or detect missing code features to explain 
unexpected behavior similar to differential analysis from 
electronics [55], [56]. For example, we are adding support to 
Alchemist plugin to iteratively scalarize memory accesses to 
characterize code and data set as CPU or memory bound [57], 
[47]. Its prototype was used to obtain line X in Figure 9 


showing ideal code behavior when all floating point memory 
accesses are NOPed. Additionally, we use Alchemist plugin to 
unify extraction of code structure, patterns and other features 
to collaboratively improve prediction during software/hardware 
co-design [36]. 

V. Enabling and improving reproducibility of 

EXPERIMENTAL RESULTS 


Missing cM feature to separate experiments: CPU frequency 



Execution time (sec.) 


Fig. 15. Unexpected behavior helped to identify and add a missing 
feature to cM (processor frequency) as well as software dependency 
(cpufreq) that ensures reproducibility of experimental results. 

Since cM allows to implement, preserve and share the 
whole experimental setup, it can also be used for reproducible 
research and experimentation. For example, unified module 
invocation in cM makes it possible to reproduce (replay) 
any experiment by saving JSON input for a given module 
and an action, and comparing JSON output. At the same 
time, since execution time and other characteristics often 
vary, we developed and shared cM module that applies 
Shapiro-Wilk test from R to test monitored characteristic 
for normality. However, in contrast with current experimental 
methodologies where results not passing such test are simply 
skipped, we record them in a reproducible way to find 
and explain missing features in the system. For example, 
when analyzing multiple executions of image corner detection 
benchmark on a smart phone shown in Figure 8, we noticed 
an occasional 4x difference in execution times as shown 
in Figure 15. Simple analysis showed that our phone was 
often in the low power state at the beginning of experiments 
and then gradually switched to the high-frequency state 
(4x difference in frequency). Though relatively obvious, 
this information allowed us to add CPU frequency to 
the build and run pipeline using cpufreq module and 
thus separate such experiments. Therefore, Collective Mind 
research methodology can gradually improve reproducibility 
as a side effect and with the help of the community rather 
than trying to somehow enforce it from the start. 

VI. Conclusions and future work 

This paper presents our novel, community-driven approach 
to make auto-tuning practical and move it to mainstream 
production environments. However, rather than searching for 
yet another “holy grail” auto-tuning technique, we propose to 
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start preserving, sharing and reusing already available practical 
knowledge and experience about program optimization and 
hardware co-design using Collective Mind framework and 
repository. Such approach helps researchers and engineers 
quickly prototype and validate various auto-tuning and 
learning techniques as plugins connected into experimental 
pipelines while reusing all shared artifacts. Such pipelines 
can be distributed among many users to collaboratively 
learn, model and tune program behavior using standard 
top-down methodology from mature sciences such as 
physics by decomposing complex software into interconnected 
components while capturing first coarse-grain effects and later 
moving to finer-grain levels. At the same time, any unexpected 
behavior and optimization mispredictions are exposed to 
the community in a reproducible way to be explained 
and improved. Therefore, we can collaboratively search for 
profitable optimizations, efficient auto-tuning strategies, truly 
representative benchmarks, and most accurate models to 
predict optimizations together with minimal set of relevant 
semantic and dynamic features. 

Our future collaborative work includes exposing more 
tuning dimensions, characteristics and features using 
Collective Mind and Periscope tuning frameworks to 
eventually tune the whole computer system while extrapolating 
collected knowledge to build faster, more power efficient and 
reliable self-tuning computer systems. We are working with 
the community to gradually unify existing techniques and tools 
including pragma-based source-to-source transformations [68], 
[69], plugin-based GCC and LLVM to expose and tune all 
internal optimization decisions [40], [42]; polyhedral 
source-to-source transformation tools [48]; differential 
analysis to detect performance anomalies and CPU/memory 
bounds [57], [47]; just-in-time compilation for Android 
Dalvik or Oracle JDK; algorithm-level tuning [70]; techniques 
to balance communication and computation in numerical 
codes particularly for heterogeneous architectures [71], [27]; 
Scalasca framework to automate analysis and modeling of 
scalability of HPC applications [72], [73]; LIKWID for 
light-weight collection of hardware counters [74]; HPCC and 
HPCG benchmarks to collaboratively rank HPC systems [75], 
[76]; benchmarks from GCC and LLVM, TAU performance 
tuning framework [77]; and all recent Periscope application 
tuning plugins [25], [26]. 

At the same time we plan to use collected and unified 
knowledge to improve our past techniques on decomposition 
of complex programs into interconnected kernels, predictive 
modeling of program behavior, and run-time tuning and 
adaptation [61], [51], [24], [78], [58], [22], [79], [80], 
[81]. Finally, we are extending Collective Mind to assist 
recent initiatives on reproducible research and new publication 
models in computer engineering where all experimental results 
and related research artifacts with all dependencies are 
continuously shared along with publications to be validated 
and improved by the community [82]. 
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